Search CORE

113 research outputs found

Assessing the Potential of Classical Q-learning in General Game Playing

Author: CB Browne
CJCH Watkins
CP Robert
D Silver
D Silver
H Wang
J Hu
J Méhat
M Genesereth
M Genesereth
M Świechowski
RS Sutton
V Mnih
Publication venue
Publication date: 14/10/2018
Field of study

\&

Stone, IJCAI 2007) in GGP. In this paper we implement Q-learning in GGP for three small-board games (Tic-Tac-Toe, Connect Four, Hex)\footnote{source code: https://github.com/wh1992v/ggp-rl}, to allow comparison to Banerjee et al.. We find that Q-learning converges to a high win rate in GGP. For the

\epsilon

-greedy strategy, we propose a first enhancement, the dynamic

\epsilon

algorithm. In addition, inspired by (Gelly

\&

Silver, ICML 2007) we combine online search (Monte Carlo Search) to enhance offline learning, and propose QM-learning for GGP. Both enhancements improve the performance of classical Q-learning. In this work, GGP allows us to show, if augmented by appropriate enhancements, that classical table-based Q-learning can perform well in small games.Comment: arXiv admin note: substantial text overlap with arXiv:1802.0594

arXiv.org e-Print Archive

Crossref

Leiden University Scholary Publications

The Impatient May Use Limited Optimism to Minimize Regret

Author: B Aminof
C Reutenauer
CJCH Watkins
E Allender
E Filiot
F Cucker
J Filar
JY Halpern
KR Apt
L Alfaro de
LS Shapley
M Jurdzinski
ML Puterman
P Hunter
R Brenguier
U Zwick
Publication venue
Publication date: 17/11/2018
Field of study

Discounted-sum games provide a formal model for the study of reinforcement learning, where the agent is enticed to get rewards early since later rewards are discounted. When the agent interacts with the environment, she may regret her actions, realizing that a previous choice was suboptimal given the behavior of the environment. The main contribution of this paper is a PSPACE algorithm for computing the minimum possible regret of a given game. To this end, several results of independent interest are shown. (1) We identify a class of regret-minimizing and admissible strategies that first assume that the environment is collaborating, then assume it is adversarial---the precise timing of the switch is key here. (2) Disregarding the computational cost of numerical analysis, we provide an NP algorithm that checks that the regret entailed by a given time-switching strategy exceeds a given value. (3) We show that determining whether a strategy minimizes regret is decidable in PSPACE

arXiv.org e-Print Archive

Crossref

Institutional Repository Universiteit Antwerpen

DI-fusion

Answer Set Programming for Non-Stationary Markov Decision Processes

Author: C Baral
CJCH Watkins
E Even-dar
E Even-Dar
J Babb
JY Yu
Leonardo A. Ferreira
M Balduccini
M Balduccini
M Gelfond
M Nogueira
Paulo E. Santos
R Bellman
R Bellman
Ramon Lopez de Mantaras
Reinaldo A. C. Bianchi
S Zhang
V Lifschitz
Publication venue
Publication date: 03/05/2017
Field of study

Non-stationary domains, where unforeseen changes happen, present a challenge for agents to find an optimal policy for a sequential decision making problem. This work investigates a solution to this problem that combines Markov Decision Processes (MDP) and Reinforcement Learning (RL) with Answer Set Programming (ASP) in a method we call ASP(RL). In this method, Answer Set Programming is used to find the possible trajectories of an MDP, from where Reinforcement Learning is applied to learn the optimal policy of the problem. Results show that ASP(RL) is capable of efficiently finding the optimal solution of an MDP representing non-stationary domains

arXiv.org e-Print Archive

Crossref

Digital.CSIC

Probabilistic inference for determining options in reinforcement learning

Author: Christian Daniel
Christopher M Bishop
CJCH Watkins
E Theodorou
Gerhard Neumann
Herke van Hoof
J Morimoto
Jan Peters
LE Baum
M Lagoudakis
ML Puterman
RS Sutton
TG Dietterich
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

Tasks that require many sequential decisions or complex solutions are hard to solve using conventional reinforcement learning algorithms. Based on the semi Markov decision process setting (SMDP) and the option framework, we propose a model which aims to alleviate these concerns. Instead of learning a single monolithic policy, the agent learns a set of simpler sub-policies as well as the initiation and termination probabilities for each of those sub-policies. While existing option learning algorithms frequently require manual specification of components such as the sub-policies, we present an algorithm which infers all relevant components of the option framework from data. Furthermore, the proposed approach is based on parametric option representations and works well in combination with current policy search methods, which are particularly well suited for continuous real-world tasks. We present results on SMDPs with discrete as well as continuous state-action spaces. The results show that the presented algorithm can combine simple sub-policies to solve complex tasks and can improve learning performance on simpler tasks

University of Lincoln Institutional Repository

TUbiblio

Crossref

MPG.PuRe

Application of reinforcement learning for segmentation of transrectal ultrasound images

Author: A Gelb
B Chiu
CJCH Watkins
CJCH Watkins
D Shen
F Sahba
F Sahba
F Sahba
F Sahba
F Sahba
Farhang Sahba
GA Taylor
Hamid R Tizhoosh
HM Ladak
HR Tizhoosh
J Rourke
JA Noble
JS Prater
L gong
M Shokri
Magdy MA Salama
MF Insana
N Betrounia
ND Nanayakkaral
RC Gonzalez
RJ Holt
RS Sutton
S Singh
SDV Pathak
SJ Russell
Y wang
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Crossref

Springer - Publisher Connector

PubMed Central

Faithful and Effective Reward Schemes for Model-Free Reinforcement Learning of Omega-Regular Objectives

Author: C Courcoubetis
CJCH Watkins
EM Hahn
F Somenzi
K Etessami
M Kwiatkowska
ME Lewis
ML Puterman
S Sickert
T Babiak
TA Henzinger
TA Henzinger
TM Liggett
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2020
Field of study

University of Liverpool Repository

Crossref

University of Twente Research Information

Game theory of mind

Author: A Benveniste
A Ng
A Traulsen
AN Hampton
B Skyrms
CF Camerer
CF Camerer
CF Camerer
CJCH Watkins
D Fudenberg
D Fudenberg
D Kahneman
D Wilson
DG Premack
DM Kreps
DO Stahl
E Fehr
E Fehr
E Todorov
H Gintis
H Gintis
HA Simon
HL Gallagher
J Moll
JM Smith
JM Smith
K McCabe
Karl J. Friston
KJ Friston
M Costa-Gomes
P Davies
P Milgrom
PA Haile
PJ Gmytrasiewicz
R Bellman
R McKelvey
Ray J. Dolan
RS Sutton
S Avner
Tim Behrens
U Frith
W Nelson
Wako Yoshida
Publication venue
Publication date: 01/01/2008
Field of study

This paper introduces a model of ‘theory of mind’, namely, how we represent the intentions and goals of others to optimise our mutual interactions. We draw on ideas from optimum control and game theory to provide a ‘game theory of mind’. First, we consider the representations of goals in terms of value functions that are prescribed by utility or rewards. Critically, the joint value functions and ensuing behaviour are optimised recursively, under the assumption that I represent your value function, your representation of mine, your representation of my representation of yours, and so on ad infinitum. However, if we assume that the degree of recursion is bounded, then players need to estimate the opponent's degree of recursion (i.e., sophistication) to respond optimally. This induces a problem of inferring the opponent's sophistication, given behavioural exchanges. We show it is possible to deduce whether players make inferences about each other and quantify their sophistication on the basis of choices in sequential games. This rests on comparing generative models of choices with, and without, inference. Model comparison is demonstrated using simulated and real data from a ‘stag-hunt’. Finally, we note that exactly the same sophisticated behaviour can be achieved by optimising the utility function itself (through prosocial utility), producing unsophisticated but apparently altruistic agents. This may be relevant ethologically in hierarchal game theory and coevolution

CiteSeerX

Crossref

Directory of Open Access Journals

UCL Discovery

PubMed Central

MPG.PuRe

Assessing the Potential of Classical Q-learning in General Game Playing

Author: CB Browne
CJCH Watkins
CP Robert
D Silver
D Silver
H Wang
J Hu
J Méhat
M Genesereth
M Genesereth
M Świechowski
RS Sutton
V Mnih
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 25/09/2019
Field of study

After the recent groundbreaking results of AlphaGo and AlphaZero, we have seen strong interests in deep reinforcement learning and artificial general intelligence (AGI) in game playing. However, deep learning is resource-intensive and the theory is not yet well developed. For small games, simple classical table-based Q-learning might still be the algorithm of choice. General Game Playing (GGP) provides a good testbed for reinforcement learning to research AGI. Q-learning is one of the canonical reinforcement learning methods, and has been used by (Banerjee & Stone, IJCAI 2007) in GGP. In this paper we implement Q-learning in GGP for three small-board games (Tic-Tac-Toe, Connect Four, Hex), to allow comparison to Banerjee et al. We find that Q-learning converges to a high win rate in GGP. For the ϵ" role="presentation" style="display: inline-table; line-height: normal; letter-spacing: normal; word-spacing: normal; overflow-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border-width: 0px; border-style: initial; position: relative;">ϵ-greedy strategy, we propose a first enhancement, the dynamic ϵ" role="presentation" style="display: inline-table; line-height: normal; letter-spacing: normal; word-spacing: normal; overflow-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border-width: 0px; border-style: initial; position: relative;">ϵ algorithm. In addition, inspired by (Gelly & Silver, ICML 2007) we combine online search (Monte Carlo Search) to enhance offline learning, and propose QM-learning for GGP. Both enhancements improve the performance of classical Q-learning. In this work, GGP allows us to show, if augmented by appropriate enhancements, that classical table-based Q-learning can perform well in small games.Computer Systems, Imagery and Medi

Crossref

Leiden University Scholary Publications

Learning and innovative elements of strategy adoption rules expand cooperative network topologies

Author: A Feigel
A Szolnoki
A Traulsen
AL Barabasi
B Skyrms
C Hauert
Changshui Zhang
CJCH Watkins
CL Tang
DJ Watts
Enrico Scalas
F Fu
FC Santos
FC Santos
G Szabó
H Ebel
H Ohtsuki
I Derenyi
IA Kovacs
J Leskovec
J Vukov
JM McNamara
JM Pacheco
L Luthi
M Girvan
M Granovetter
M Kirschner
M Tan
M Tomassini
M Tomassini
MA Nowak
MA Nowak
MA Nowak
MD Cohen
MW Macy
MW Macy
Máté S. Szalay
N Masuda
P Csermely
P Holme
Peter Csermely
R Axelrod
R Durrett
RE Michod
RJ Aumann
RS Sutton
S Goyal
Shijun Wang
TW Sandholm
V Batagelj
Z Rong
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 09/04/2008
Field of study

Cooperation plays a key role in the evolution of complex systems. However, the level of cooperation extensively varies with the topology of agent networks in the widely used models of repeated games. Here we show that cooperation remains rather stable by applying the reinforcement learning strategy adoption rule, Q-learning on a variety of random, regular, small-word, scale-free and modular network models in repeated, multi-agent Prisoners Dilemma and Hawk-Dove games. Furthermore, we found that using the above model systems other long-term learning strategy adoption rules also promote cooperation, while introducing a low level of noise (as a model of innovation) to the strategy adoption rules makes the level of cooperation less dependent on the actual network topology. Our results demonstrate that long-term learning and random elements in the strategy adoption rules, when acting together, extend the range of network topologies enabling the development of cooperation at a wider range of costs and temptations. These results suggest that a balanced duo of learning and innovation may help to preserve cooperation during the re-organization of real-world networks, and may play a prominent role in the evolution of self-organizing, complex systems.Comment: 14 pages, 3 Figures + a Supplementary Material with 25 pages, 3 Tables, 12 Figures and 116 reference

arXiv.org e-Print Archive

Crossref

Directory of Open Access Journals

PubMed Central

Reinforcement learning or active inference?

This paper questions the need for reinforcement learning or control theory when optimising behaviour. We show that it is fairly simple to teach an agent complicated and adaptive behaviours using a free-energy formulation of perception. In this formulation, agents adjust their internal states and sampling of the environment to minimize their free-energy. Such agents learn causal structure in the environment and sample it in an adaptive and self-supervised fashion. This results in behavioural policies that reproduce those optimised by reinforcement learning and dynamic programming. Critically, we do not need to invoke the notion of reward, value or utility. We illustrate these points by solving a benchmark problem in dynamic programming; namely the mountain-car problem, using active perception or inference under the free-energy principle. The ensuing proof-of-concept may be important because the free-energy formulation furnishes a unified account of both action and perception and may speak to a reappraisal of the role of dopamine in the brain

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals